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Abstract 

o 

We provide evidence that a root-mean-square test of goodness-of-fit can be significantly more powerful 
than state-of-the-art exact tests in detecting deviations from Hardy-Weinberg equilibrium. Unlike Pear- 
son's chi-square test, the log-likelihood-ratio test, and Fisher's exact test, which are sensitive to relative 
discrepancies between genotypic frequencies, the root-mean-square test is sensitive to absolute discrep- 
p-^ ancies. This can increase statistical power, as we demonstrate using benchmark datasets and through 

asymptotic analysis. With the aid of computers, exact P-values for the root-mean-square statistic can 
j be calculated effortlessly, and can be easily implemented using the author's freely available code. 
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1 Introduction 



^ In 1908, G. H. Hardy [Har08) and W. Weinberg jWeiOS) independently derived mathematical equations 

to corroborate the theory of Mendelian inheritance, proving that in a large population of individuals 
subject to random mating, the proportions of alleles and genotypes at a locus stay unchanged unless 
specific disturbing influences are introduced. Today, Hardy-Weinberg equilibrium (HWE) is a common 
hypothesis used in scientific domains ranging from botany [Wei05] to forensic science [Cou96| and ge- 
netic epidemiology [ShaOll rKLB04| . Statistical tests of deviation from Hardy-Weinberg equilibrium are 
04 fundamental for validating such assumptions. Traditionally, Pearson's chi-square goodness-of-fit test, or 

" . l an asymptotically-equivalent variant such as the log-likelihood-ratio test, was used for this assessment. 

^ ^ Before computers became readily available, the asymptotic chi-square approximation for the statistics 

used in these tests, however poor, was the only practical means for drawing inference. With the now 
widespread availability of computers, exact tests can be computed effortlessly, opening the door to more 
powerful goodness-of-fit tests. In their seminal paper [GT92] . Guo and Thompson campaigned for an 
exact test of HWE based on the likelihood function. While their work renewed interest in conditional 
exact tests for Hardy-Weinberg equilibrium |RR95I IDS981 IWC A05) . likelihood-based tests have also been 
subject to criticism, and there is little evidence that such tests are more powerful than other exact tests, 
such as those based on likelihood-ratios |Eng09| or the root-mean-square. 

In this article, we demonstrate using the classical datasets from Guo and Thompson |GT92| that 
goodness-of-fit tests based on the root-mean-square distance can be more powerful than all of the classic 
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1 



tests at detecting deviations from Hardy- Weinberg equilibrium, by up to an order of magnitude. Our 
results should not be confused with good luck or anomaly. Upon further analysis of the datasets, it is 
revealed that the classic tests, tuned to detect relative discrepancies, can be blind to overwhelmingly large 
discrepancies among common genotypes that are drowned out by expected finite-sample size fluctuations 
in rare genotypes. The root-mean-square statistic, on the other hand, is tuned to detect deviations in 
absolute discrepancies, and easily detects large discrepancies in common genotypes. 

To make these observations precise, we show that in the asymptotic limit where the number of draws 
and number of alleles tend to infinity together, the root-mean-square statistic has asymptotic power 
one while the classic statistics have asymptotic power zero against a family of alternatives involving one 
common allele, several rare alleles, and an excess of observed common genotypes. We observe numerically 
that this asymptotic limit is reached very quickly, on datasets involving no more than ten alleles and 30 
draws. In the classical asymptotic limit, where the number of draws tends to infinity but the number of 
alleles remains fixed, the root-mean-square statistic is a linear combination of Gaussian random variables; 
as shown in [PTWllbl IPTWllcl IPTWlla] , the root-mean-square statistic is often more powerful in this 
asymptotic limit as well as others. 

We keep in mind that none of the statistics we consider produces a test that is uniformly more 
powerful than any other, but while each statistic focuses power on its own class of alternatives, we 
maintain that the root-mean-square statistic is more relevant for many deviations of interest in practice. 
At the very least, the root-mean-square statistic and the classic statistics focus on complementary classes 
of alternatives, and their combined P-values provide a more informative test than either P-value used 
on its own. 

The results of our analysis are consistent with the numerous experiments conducted in the recent 
work [FTWlla] . which highlight the power of the root-mean-square statistic over classic statistics in 
detecting meaningful discrepancies in nonuniform distributions. The recent paper |Tygl2| provides sev- 
eral representative examples for which the root-mean-square test is more powerful than Fisher's exact 
test for homogeneity in contingency-tables. We should remark that the root-mean-square statistic is not 
completely unrelated to other statistics used in practice; as shown in [CTW12] . the root-mean-square 
statistic can be interpreted as an analog to the discrete Kolmogorov-Smirnov statistic for nominal data. 

This article is structured as follows: in Section 2 we recall the set-up and motivation for testing 
Hardy- Weinberg equilibrium. We describe the relevant test statistics in Section 3, and we compare the 
performance of these statistics on the classic datasets from Guo and Thompson in Section 4. We provide 
an asymptotic analysis of the various statistics in Section 5 to highlight the limited power of the classic 
statistics compared to the root-mean-square statistic in distinguishing important classes of deviations 
from Hardy- Weinberg equilibrium. 

2 Hardy- Weinberg equilibrium: set-up and motivation 

Recall that a gene refers to a segment of DNA at a particular location (locus) on a chromosome. The 
gene may assume one of several discrete variations, and these variants are referred to as alleles. An 
individual carries two alleles for each of her autosomal genes — one allele selected at random from the 
pair of alleles carried by her mother, and one allele selected at random from the pair of alleles carried 
by her father. These two alleles, considered as an unordered pair, constitute the individual's genotype. 
A gene having r alleles Ai, A2, . . . , Ar has r(r + l)/2 possible genotypes. These genotypes are naturally 
indexed over a lower-triangular array as in Figure [l] 

A population is said to be in Hardy- Weinberg Equilibrium (HWE) with respect to the system of alleles 
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Figure 1: Enumeration of genotypes for a gene having r alleles Ai, A2, . . . , A^. 



if the proportion of individuals in the population with two distinct alleles is twice the product of the 
allele proportions and the proportion of individuals in the population with two copies of the same allele 
is the square of that allele's frequency in the population. That is, if Pj^k is the relative proportion of 
genotype {Aj,Ak} in the population, and if 9k is the proportion of allele Ak in the population, then the 
system is in equilibrium if 

m J j>k 

{ oi, J = k. 

A large population of genotypes satisfying Hardy- Weinberg equilibrium will remain in Hardy- Weinberg 
equilibrium assuming random mating and no disturbing forces (e.g., no selection, no mutation, no mi- 
gration, and so on). Moreover, Hardy- Weinberg equilibrium is neutral: if any assumptions are violated 
in a particular generation and a population is perturbed, then in the next generation the population will 
be in a new equilibrium (assuming assumptions are restored) specified by the new allele proportions. 
Hardy- Weinberg equilibrium is then a robust and reliable certificate that a population is not evolving 
with respect to the gene of interest, and the detection of deviations from Hardy- Weinberg equilibrium is 
paramount in genetic analyses. 

3 Testing for deviations from Hardy- Weinberg equilibrium 

In practice, one rarely has access to the genetic profile of every individual in a population, and statis- 
tical inference must be based on a random sampling of the population. If a population of genotypes 
{Aj , Ak} with underlying genotypic proportions pj^k is sufficiently large, a random sample of n genotypes 
Xi,X2,- ■ -Xn from this population can be regarded as a sequence of independent and identical draws 
from the multinomial distribution specified by probabilities 

Prob^Xi = {A,-, Afc}) = p,-,fc, l^k^j^r. (2) 

If rij^k realizations of genotype {Aj, A^} are observed in the sample of n genotypes, then the number of 
instances of allele A^ in the observed sample of 2n alleles is 

r j 

nj = XI '^fc.J + S J = '^,---,r. (3) 

k—j k—1 

In order to gauge the consistency of the sample counts (rij.k) with Hardy- Weinberg equilibrium, we must 
first specify the r — 1 free parameters Si, 02, ... , Or-i corresponding to the underlying allele proportions 
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in the HWE model ([T]). Intuitively, the observed proportions of alleles, ni/(2n),n2/(2n), . . . ,n,.-i/(2n), 
serve as our best estimates for 6\,92, ■ ■ ■ , 0, — i; these parameter specifications give rise to the model counts 
of genotypes under Hardy-Weinberg equilibrium. 



{njnk)/{2n), j > k 



ruj^k = mj,k{ni,n2, . . . , n^) 



n,V(4n), j = k. 



(2 - Sjk){nj nfe)/(4n), 



(4) 



where Sjk is the Kronecker delta function. 




0, if j ^ fc 

1, if j = k. 



It is not difficult to verify that the observed proportions of alleles are indeed the maximum-likelihood 
estimates for the underlying parameters 6i,02, ■ . ■ ,9, — i in the family of HWE equilibrium equations ([l]). 

The observed counts of genotypes rij^ft in a random sample from a population in Hardy-Weinberg 
equilibrium should not deviate too much from their model counts rrij^k- However, a systematic approach 
is needed to distinguish inevitable finite-population and finite-sample size fluctuations from model mis- 
match. Without additional prior information, a goodness-of-fit test serves as an omnibus litmus test to 
gauge consistency of the data with the Hardy-Weinberg equilibrium model. Ideally, the goodness-of-fit 
test should be sensitive to a wide range of possible local alternatives; more realistically, several different 
goodness-of-fit tests can be used jointly, each sensitive to its own class of alternatives. If a nonpara- 
metric test as such indicates deviation from equilibrium, different parametric tests can then be used to 
elucidate particular effects of the deviation such as directions of disequilibrium or level of inbreeding. 
Several parametric Bayesian methods have been proposed, and we refer the reader to the discussions in 
ICT99I [SPW981 IAB98I IlNF+09[ ILGMI ICMV11| . In this paper we will focus only on nonparametric (or 
nearly nonparametric) tests of fit, but we emphasize that goodness-of-fit tests should be combined with 
Bayesian approaches and other types of evidence for and against the HWE hypothesis before drawing 
final inference. 

3.1 Goodness-of-fit testing 

A goodness-of-fit test compares the model and empirical distributions using one of many possible mea- 
sures. Three classic measures of discrepancy, all special cases of Cressie-Read power divergences, are 
Pearson's x^-divergence 




(5) 



the log-likelihood-ratio or g 



divergence. 




(6) 



and the Hellinger distance 




(7) 
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The end result of a goodness-of-fit test is the P-value, the probabihty of observing a discrepancy between 
model and sample proportions of genotypes at least as extreme as the measured discrepancy, under the 
null hypothesis of i.i.d. draws from the model. If a goodness-of-fit test returns a sufficiently small P-value 
— e.g. .01 or .001 — then one can be highly confident that the model assumptions do not hold. A more 
powerful measure of discrepancy for a given data set will produce a smaller P-value if the null hypothesis 
is not satisfied. We remark that there are subtleties involved with the definition and interpretation of 
Pvalues, as discussed, for example, in Section 3 of [PTWlla] . 

In this paper, we distinguish two types of commonly-used P-values, which we refer to as the plain 
P-value and fully conditional P-value. We remark that one could also consider Bayesian P-values [Gelj . 
among other formulations. 

To compute the plain P-value, one repeatedly simulates n i.i.d. draws from the model multinomial 
distribution {rrij^k/n) . For each simulation i, the genotype counts A'j'^ , allelic counts A'^-'^ = ( X;fc=j -^fe'] + 
^i^i Nj'l) , allelic proportions O^'' = Aj'-'/(2n) , and equilibrium model counts associated to this sample, 
Mj^l = (2 — 5_,,fc)A'j'-' A^'-'/(4n), are computed. The plain P-value is the fraction of times the discrepancy 
between the simulated counts (Aj'^) and their model counts {Mj'l) is at least as large as the measured 
discrepancy between the observed counts rij^k and their model counts mj,^. 

The fully conditional P-value corresponds to imposing additional restrictions on the probability space 
associated to the null hypothesis. To compute the fully conditional P-value, the observed counts of alleles, 
ni, . . . ,nr, are treated as known quantities in the model, to remain fixed upon hypothetical repetition 
of the experiment. This would hold, for example, if the sample population used in the experiment were 
the entire population of individuals. More specifically, one repeatedly simulates n i.i.d. draws from the 
hypergeometric distribution that results from conditioning the multinomial model distribution (mj^k/n) 
on the observed allele counts, Ni = ni, A2 = n2, . . . , Nr = n,.. In |GT92] . Guo and Thompson provided 
an efficient means for performing such a simulation: apply a random permutation to the sequence 

ni 712 

A = I Ai,Ai"t. . . , Ai, A^.,A2, ...,a77?7a^ |, (8) 

and identify the pairs {A2j, A2J+1}. The fully conditional P-value is the fraction of times the discrepancy 
between the simulated counts (Nj'l) and the model counts (rrij^k) is at least as large as the measured 
discrepancy. 

Pseudocode for calculating plain and fully conditional P-values is provided in Algorithms |5.1| and |5.2| 
of the Appendix. 

Remark 3.1. It is important to note that P-values computed by repeated Monte-Carlo simulation 
are really exact: Given any specified precision £, Hoeffding's inequality guarantees with 99.9% certainty 
that the Pvalue obtained using I simulations will equal the P-value P obtained using infinitely many 
simulations to within precision e, as long as the number of Monte Carlo simulations exceeds I = 4P(1 — 
P)/e^. In all of our experiments, we used I = 16, 000, 000 Monte Carlo simulations so that the reported 
three digits of precision in our P-values are correct with 99.9% certainty. 

Remark 3.2. The r — 1 parameters in the family of HWE distributions ([T| are often referred to as 
nuisance parameters. The fully conditional test we describe is a variant of the conditional test pro- 
posed by R.A. Fisher for dealing with nuisance parameters in the context of contingency table analysis 
|Fis25l IMP83| , and amounts to conditioning on a minimally sufficient statistic for estimating the nui- 
sance parameters. Conditioning in the context of HWE testing dates back to the wor ks of [Lev49llHal54| . 
but was not considered feasible for large data sets until Guo and Thompson derived the aforementioned 
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method for efficiently simulating draws from the conditional distribution. Note that while conditioning 
on the counts of alleles in the observed population does effectively remove parameters from the null 
hypothesis, it also imposes additional assumptions on the experiment that are not necessarily reflective 
of reality. In small samples drawn from a large genotype population, the allele counts are not known a 
priori and estimates of the allele counts are subject to change upon repetition of the experiment. Unlike 
the fully conditional P-value, the plain P-value takes this into account; for more details, see Section 3 of 
IPTWlla) . 

3.2 The negative log-likelihood statistic 

A popular alternative to testing Ifardy- Weinberg equilibrium with the power divergence discrepancies 
, , and is to use a discrepancy based directly on the likelihood function for the multinomial 
distribution, 

C{nj,k\ n, mj,k) = Prob^^iVi,! = ni,i, A''2,i = n2,i, . . . , iVr.r = n^,,-) 



— II I "^1 1 '"'1 2 ■ • • '"T^r ■ 

ni,i!ni,2! • . . Wr.rln" • 

Because the likelihood function has an excessively large dynamic range, the negative of the logarithm of 
the likelihood is instead used as a test statistic: 

I = -logic). (10) 

Because the logarithm is nondecreasing over its domain, the negative likelihood function and negative 
log-likelihood function produce the same P-value, and this P-value can be interpreted as the probability 
of observing data with equal or lesser likelihood than that observed under the null hypothesis. The 



negative log-likelihood function (10 1 looks similar to the log-likelihood-ratio function g^, but there is an 



important distinction to be made: the log-likelihood-ratio, which sums the logarithms of ratios between 
observed and expected counts, is a proper divergence. The negative log-likelihood function is not a 
divergence, and this results in several undesirable properties that have led many to criticize its use 
|GP75llRX75] |Eng09J. 

The negative log-likelihood function does have something in common with the power-divergence dis- 
crepancies: under the null-hypothesis, the negative log-likelihood statistic L and the power divergence 
statistics X^, G^, and all become a chi-square random variable with r(r — l)/2 — 1 degrees of freedom 
as the number of draws n goes to infinity and number of alleles remains fixed |Bro65| . Before comput- 
ers became widely available, using a statistic with known asymptotic approximation was necessary for 
obtaining any sort of approximate P-value. The exact (non-asymptotic) P-values for these statistics or 
any other measure of discrepancy can now be computed effortlessly using Monte-Carlo simulation. 

3.3 The root-mean-square statistic 

A natural measure of discrepancy for goodness-of-fit testing which has not received as much attention in 
the literature is the root-mean-square distance, 

■^-(;^M^ S (11) 
The square of the root-mean-square distance is proportional to Pearson's discrepancy when the 
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model distribution is uniform, but takes on a very different character when the model distribution diverges 
from uniformity. Note that in practice, multiallelic distributions of genotypes are often very nonuniform, 
due to the presence of a few common alleles and several rare alleles. 

In contrast to the classic statistics, the asymptotic distribution for the root-mean-square statistic 
F in the limit of infinitely many draws and fixed alleles, while completely well-defined and efficient to 
compute, depends on the model distribution [PTWllbl IPTWllc] . This has likely contributed to its 
underrepresentation in the literature, as much of the classical statistical methodology was canonized 
before computers became readily accessible. Using the pseudocode provided in Algorithms |5.1| and |5.2| 
we can now obtain exact P-values for the root-mean-square statistic just as easily as we can compute 
exact P-values for any of the classic statistics. 



4 Numerical results 

We are now ready to compare the performances of the root-mean-square statistic and the classic statistics 
in detecting deviations from Hardy- Weinberg equilibrium. We evaluate the performance of the various 
statistics on three benchmark datasets from Guo and Thompson [GT92] . The three datasets, which we 
refer to as Examples 1, 2, and 3, are represented in Figure [2] as lower-triangular arrays of counts. The 
bold entry in each cell corresponds to the number rij^k of observed counts of genotype {Aj,Afe} in the 
sample, and the second entry in each cell corresponds to the expected number rrij^k of counts under 
HWE. 

For each example, and for each of the five statistics X^, G^, , L, and F, we calculate both the plain 
and fully conditional P-values using 16, 000, 000 Monte-Carlo simulations for each calculation. Recall 
that a small P-value P lets us infer, with (100(1 — P))% confidence, that the draws are not i.i.d. or the 
draws are inconsistent with the HWE model. 

The results of the analyses of Examples 1,2, and 3 — displayed in Tables 1, 2, and 3 — suggest that 
for both plain and fully conditional exact tests of goodness-of-fit, the root-mean-square statistic can be 
significantly more powerful than the classic statistics in detecting deviations. 

Figures [3] |4] and [5] contain boxplots displaying the median, upper and lower quartiles, and whiskers 
reaching from the 1st to 99th percentiles for relative root-mean-square discrepancies and relative chi- 
square discrepancies simulated under the plain Hardy- Weinberg equilibrium null hypothesis for the 
datasets from Examples 1, 2, and 3. The boxplots are for simulated data, whereas the large open 
circles indicate the observed data. For a detailed description of these plots, we refer the reader to the 
Appendix. In the chi-square boxplots, we see the division by expected proportion in the summands of 
the chi-square discrepancy ([5| reflected in the larger contribution of relative discrepancies to the reported 



P-values; in contrast, we see the equal- weighting of the summands of the root-mean-square distance (111 
reflected in the larger contribution of absolute discrepancies to the reported root-mean-square P-values. 
In Section [5j we will see that all of the classic statistics, not just the chi-square statistic, are sensitive to 
relative rather than absolute discrepancies. 



4.1 Interpretation of the results for Example 1 

Comparing the boxplots in Figure [3j we see that both chi-square and root-mean-square tests report a 
statistically significant deviation in the largest index, among others. The largest index corresponds to the 
18 observed counts versus 10 expected counts of genotype {A3, A2} in Example 1. However, the P-value 
reported by the root-mean-square test is an order of magnitude smaller than the P-value reported by 
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Table 1: P- values with 99.9% confidence intervals for Pearson's statistic , the log-likelihood-ratio statistic 
G'^, the Hellinger distance the negative log-likelihood statistic L, and the root-mean-square statistic F, for 
the observed genotypic counts in Example 1 to be consistent with the Hardy- Weinberg equilibrium model (Ul. 



Statistic 


plain Pvalue 


fully conditional Pvalue 




.693 +.001 


.709 +.001 


G2 


.600 +.001 


.630 +.001 




.562 +.001 


.602 +.001 


L 


.648 +.001 


.714 +.001 


F 


.039 + .001 


.039 + .001 



chi-square test, as this discrepancy is larger compared to expected root-mean-square fluctuations than it 
is compared to expected chi-square fluctuations. In the chi-square summation, the statistical significance 
of this deviation (as well as the deviations in indices 6 and 7) is washed out by large expected relative 
deviations in the rare genotypes. 

4.2 Interpretation of the results for Example 2 

The distribution of discrepancies in Figure [4] can be interpreted similarly to the boxplots from Figure 
[3] both the chi-square and root-mean-square tests report a statistically significant deviation in the 
5th-largest index, corresponding to the 982 observed counts versus 1057.6 expected counts of genotype 
{A4,Ai} in Example 2. However, the P-value reported by the root-mean-square test is an order of 
magnitude smaller than the P-value reported by chi-square test, as this discrepancy is larger compared 
to expected root-mean-square fluctuations than it is compared to expected chi-square fluctuations. In 
the chi-square summation, the statistical significance of this deviation is washed out by large expected 
relative deviations in the rare genotypes. In contrast to the n = 45 draws from Example 1, this dataset 
contains n = 8297 draws; we infer that the qualitative differences between the root-mean-square and 
chi-square statistic are not unique to small sample-size data. 

4.3 Interpretation of the results for Example 3 

Comparing the expected and observed chi-square discrepancies in Figure [5|b), we might posit that the 
small P-value of .015 + .001 that the chi-square test gives to the data in Example 3 depends strongly on 
the discrepancy at the 4th index on the plot, corresponding to a single draw of genotype {Ae, As}. By 
removing this draw from the dataset and re-running the chi-square goodness-of-fit test on the remaining 
n = 29 draws, the chi-square statistic returns a P-value of .207+. 001, well over an order of magnitude 
larger than the previous P-value, confirming that the small P-value given by the chi-square statistic for 
the dataset in Figure 2(a) is the result of observing a single rare genotype. The root-mean-square statistic 
is not as sensitive to this discrepancy. 



5 An asymptotic power analysis 

In this section we give theoretical justification to our assertion that the root-mean-square statistic can 
be more powerful than the classic statistics in detecting deviations from Hardy- Weinberg equilibrium. 
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(a) Example 1: n = 45. 
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(b) Example 2: n = 8297. 
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(c) Example 3: n = 30. 



Figure 2: The three datasets from Guo and Thompson |GT92j . Observed counts are in bold and 
model counts are below. 
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Table 2: P-values with 99.9% confidence intervals for Pearson's statistic , the log-Ukelihood-ratio statistic 
, the Hellinger distance , the negative log-Hkehhood statistic L, and the root-mean-square statistic F, for 
the observed genotypic counts in Example 2 to be consistent with the Hardy- Weinberg equilibrium model (Ul. 



Statistic 


plain Pvaluc 


fully conditional Pvalue 




.020 +.001 


.020 +.001 




.013 +.001 


.013 +.001 




.027 +.001 


.025 +.001 


L 


.016 +.001 


.018 +.001 


F 


.002 + .001 


.002 + .001 



Table 3: P-values with 99.9% confidence intervals for Pearson's statistic X^, the log-likelihood-ratio statistic 
G^, the Hellinger distance H^, the negative log-likelihood statistic L, and the root-mean-square statistic F, for 
the observed genotypic counts in Example 3 to be consistent with the Hardy- Weinberg equilibrium model 



Statistic 


plain Pvalue 


fully conditional Pvalue 




.015 + .001 


.026 + .001 




.181 +.001 


.276 +.001 




.307 +.001 


.449 +.001 


L 


.155 +.001 


.207 +.001 


F 


.885 + .001 


.917 + .001 



0.8 

0.6 



o 



6 6® 



(1) 

9 



I 6 f 



Genotypes from Example 1 , ordered by increasing model count 

(a) Expected vs. observed relative root-mean-square discrepancies for the data in 
Example 1 
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Genotypes from Example 1 , ordered by increasing model count 
(b) Expected vs. observed relative discrepancies for the data in Example 1 



Figure 3 
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Genotypes from Example 2, ordered by increasing model count 

(a) Expected vs. observed relative root-mean-square discrepancies for the data in Example 2 
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Genotypes from Example 2, ordered by increasing model count 

(b) Expected vs. observed relative discrepancies for the data in Example 2 

Figure 4 
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Genotypes from Example 3, ordered by increasing model count 

(a) Expected vs. observed relative root-moan-square discrepancies for the data in Example 3 
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Genotypes from Example 3, ordered by increasing model count 

(b) Expected vs. observed relative discrepancies for the data in Example 3 



Figure 5 
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We will show for a representative family of datasets that the root-mean-square statistic has asymptotic 
power one while the chi-square statistics have asymptotic power zero. To model the setting where the 
number of draws and number of genotypes are of the same magnitude, we consider the limit in which 
the number of alleles and number of draws go to infinity together. Note that the asymptotic chi-square 
approximation to the classic statistics is not valid in this limit. 

We consider a gene having r + 1 alleles, one common allele and r rare alleles. The Common Allele 
dataset we consider involves n = Sr observed genotypes, distributed as indicated below. 

Table 4: Common Allele dataset 



n = 


3r observed genotypes 






ni,i 


= r of type {Ai, Ai}, 


ni^i/n = 


1/3 




= 2 of type {Ai, Afc}, 


niM/n = 


2/(3r), 2^k^r + l 


nj,k 


= of type {Aj, Afc}, 


Hj^k/n = 


0, 2sgjsgfc<r + l 


ni = 


= 4r alleles of type Ai , 


7ii/(2n) 


= 2/3 


rik = 


= 2 alleles of type Afc, 


rtfc/(2n) 


= l/(3r), 2s;fcs;r + l. 



The maximum-likelihood model counts for the Common Allele dataset are 



mi,i = 4r/3, 
mi,fc = 4/3, 
mk,k = l/(3r), 
mj,fc = 2/(3r), 



2 s; fe s; r + 1, 
2 s£ fc s£ r + 1, 
25£j<fc5£r-l-l, 



(12) 



j < k. 



To see that the Common Allele dataset becomes increasingly inconsistent with the Hardy- Weinberg model 
as r increases, observe that under the null hypothesis, we would expect in a sample of n = 3r genotypes 
to see r/3 = J^^jZl 21=2 "^i.fc genotypes containing only rare alleles. The Common Allele dataset however 
contains no genotypes containing only rare alleles. In spite of this inconsistency, we will prove that the 
plain P-values for each of the four classic statistics X^, , H^, and L converge to 1 as r —>■ co, indicating 
zero asymptotic power. In contrast, the P-value for the root-mean-square statistic converges to zero. 

Theorem 5.1. In the limit as r co, the plain P-values (as computed via Algorithm 5.1) given by , 
the log-likelihood-ratio statistic , the Hellinger distance , and the negative log-likelihood statistic L 
for the Common Allele dataset to be consistent with the Hardy -Weinberg equilibrium model all converge 
to 1, while the plain P-value for the root-mean-square statistic converges to 0. 

The crux of the proof is that, as r increases, relative fluctuations in the rare genotypes simulated under 
HWE become sufficiently large that the sum of relative discrepancies expected under the null hypoth- 
esis exceeds the sum of the observed relative discrepancies. However, the sum of absolute fluctuations 
expected under the HWE model remains bounded below the sum of the observed absolute discrepancies. 

In the proof of Theorem |5.1| we will use the notation it„ > Vn to indicate that there exists some 
absolute constant C > such that Un ^ Cv„ for all n = {1,2,...}. We use the notation u < v 
accordingly. We will use C > to denote a positive universal constant that might be different in each 
occurrence. We write X{r) y to mean that the distribution X{r) converges to the value y as r — > oo. 



Proof of Theorem \5.1\ Recall the relevant notation for computing plain P-values in Algorithm |5.1[ along 
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1 10 20 30 40 50 
Number of rare alleles in Common Allele Dataset 



1 10 20 30 40 50 
Number of rare alleles in Common Allele Dataset 



Fi gurc 6l P- values (accurate to three digits with 99% confidence) for Pearson's statistic AT"^, the log— Ukelihood- 
ratio statistic G^, the HcUinger statistic H^, the negative log-likelihood statistic L, and the root-mean-square 
statistic _F, for the observed genotypic counts in the Common Allele dataset to be consistent with the Hardy- 
Weinberg equilibrium model (Ul, as a function of the number of alleles r. 



with the Common Allele dataset in Table 4 and its maximum-likelihood HWE model counts (121. Here 
and throughout, we will refer to Ai as the common allele and to as the common genotype; we 

will refer to the remaining r alleles as rare, to genotypes of the form {j4i,ylj}, 2 ^ j ^ r -I- 1, as rare 
observed genotypes, and to genotypes of the form {Aj, At}, 2^j^k^r+l£iS unobserved genotypes. 

1. Because the model proportion 6i = 2/3 remains constant as r increases but the number of draws 
n = 3r tends to infinity, the law of large numbers implies that Qi 6i = 2/3. Accordingly, 
Mi^i/n — > mi,i/n = 4/9 and X;j=2 9j = 1 ^ ©i ~> 1/3. In words, eventually 2/3 of the simulated 
alleles and 4/9 of the simulated genotypes from the model will be common. 

2. Similarly, Y.ltl M,,i/n ^ T.ltl mi.k/n = 4/9 and YI,tl M^Jn YZ\ T^^tl ^kju = 1/9. 
In words, roughly 4/9 of the draws simulated from the model will be rare observed genotypes, while 
1/9 of the simulated draws will unobserved genotypes. 

3. With probability approaching 1 as r — > cxj, each of the roughly n/9 = r/3 simulated draws from 
the pool of (r'^ — r)/2 unobserved genotypes will have a different genotype from the others. At this 
point, roughly r/3 of the unobserved simulated proportions Nj^t/n, 2$;fc$;j^r-l-l, will equal 
l/(3r), while the others will equal 0. 

4. The coupon collector's problem (see, for example, [MR95j l implies that with probability approach- 
ing 1 as r — > cxD, among the roughly 2r simulated draws from the pool of r rare alleles, no rare allele 
will be drawn more than log(r) times (fixing the base of the logarithm at any real number greater 
than 1 that does not depend on r), and at least 3r/4 among the r rare alleles will be drawn at least 
twice. 

In particular, the last point above implies that, with probability approaching 1 as r — > oo, all of the 
simulated rare proportions Oj = Oj{r), 2 ^ j ^ r + 1, will satisfy 

e,(r) ^log(r)/r (13) 
and, for at least 3r/4 among the r simulated rare proportions, 

l/(3r) s: e^(r) s: log(r)/r. (14) 
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1. The P- value for the root-mean-square goes to when r — > cxd. The measured sum-square 
discrepancy = lll^tllj^ (jgtween the observed proportions rij^t^/n and the model proportions 
rrij^t^/n is 



r + l 



■1\2 4 1 2(r-l) 



2 



Sir 81r» 81r3 



(15) 



As r — > 00, 



I- \. (16) 
If we instead consider the sum-square statistic = iljhiUljt^T?^ j-ggyi^^ijjg from drawing n = 3r 



genotypes i.i.d. from the model distribution (12 I, points 1, 3, and 4 above give 
p2 < (iVi,i - 4r/3)^ y /(logr 

9^2 Zj 1, J. 

2s;fc!Sjs;r + l:]Vjj,=l 2^ fcsjj s;r+ 1: JV^ j, =0 

Z'^ I (log^)"^ I ^ I (^ ^(^+1) 7-wlogrN4 
27r/4 r 3 9r2 I 2 3A r 7 ' ^ ^ 

where Z = (iVi,i - 4r/3)/v'4r/3 converges in distribution to a standard normal distribution as 
r ^ 00. Therefore, as r — > co, 

F 



Combining ( 16 1 and ( 17 1 shows that the P-value for the root-mean-square statistic, P = Prob{_F ^ 
/} = Prob{_F 5j /}, goes to as r — > 00. 

2. The P-value for goes to 1 as r -* cxj. Similar to the measured sum-square discrepancy /, 
the measured discrepancy = x^ jn converges to some finite positive real number as r — > 00. 
Alternatively, if we simulate n = 3r genotypes from the model distribution and (following point 
3 above) consider only those roughly r/3 summands in the normalized x^ statistic = jn 
corresponding to the unobserved genotypes with one simulated draw, 

X > - mm \ —] /' 

3 2s;fcs;js:r+i:]Vjj-=i V n n J 

^ r/1 ^logr^ 2 ^^^ogr^ 2 
~ 3 V3r " 



/ Fr^ • (18) 



It follows that > fi^^^yi — * 00, and so the P-value for the x^ statistic, P = Prob(X^ ^ x^) 



Prob(X^ )i goes to 1 as r — > cxd. 



3. The P-values for the log— likelihood-ratio and negative log— likelihood L go to 1 when 

r — > cxj by an argument analogous to that used for the x^ P-value. 

4. The P-value for the Hellinger statistic goes to 1 when r — > 00. We have to be a bit more 
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careful with the analysis of the Hellinger discrepancy = h? /{4,'n). The observed discrepancy is 



(V3-2) = 



(V3-2)^ 



.14., 



J=2 

10 - 4V6 1 
^ + 9 



4 n2 
Or/ 



r+1 ^ 



9^.2 Z-1 
2!Sfc<j!Sr+l j=2 



(19) 



Alternatively, suppose we simulate n = 3r genotypes from the model distribution and consider r 
sufficiently large. Each estimated rare allele proportion will be bounded: Qj ^ log(r)/r, as stated 
in ( |13[ ). Furthermore, by ( |14[ ), at least 3/4 of these proportions will satisfy Qj ^ l/(3r), ensuring 

that at least ^^^'^^ ^ r among the r(r + l)/2 simulated proportions for the unobserved genotypes 

satisfy Mj^k/n ^ 2/(9r^). Then, for sufficiently large r. 



2^ fcsjr + l 



^ #{j,fc:iV,,fe = l}( 



log (r) N 2 



1 
3r 



(©\--#Ufc:iV..-l})( 



'3n2 
.47 2 
r / 1 

.17.... 



log r 



9r2y 
3) (9H. 



(20) 



Combining (19 1 and (20 1, we conclude that the P- value for the Hellinger distance. 



P = Prob(_H'^ 5= /i^) = Prob(7i"^ ^ ft^), goes to 1 as r ^ 00. 



□ 



Figure |6] shows that the convergence of the classic P- values to 1, and of the root-mean-square P- value 
to 0, occurs very quickly. This convergence is demonstrated for both the plain and fully conditional 
P- values, even though Theorem |5 . 1 1 applies directly only to the plain P- values. 

To conclude this section, we remark that the particular distribution of the draws in the Common 
Allele dataset was rather arbitrary, and that a similar asymptotic analysis holds for many other datasets. 
For example, we could have considered instead a dataset involving two, three, or four common alleles, or 
one common allele and three fairly-common alleles, and so on. 
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Appendix I: Pseudocode for calculating exact P-values 

Algorithm 5.1: Computing the plain P- value 



Input: Observed genotype counts Uj k, number of Monte Carlo simulations £, and test statistic 
S {e.g. S = X^G^H^...) 

Output: plain P-value associated to test statistic S{nj^k,rnj^k) 



Compute maximum-likelihood model counts rrij^k = (2 — (5jfe)(%- ?T.fe)/(4n) 
Measure the discrepancy s = S{nj^k,'m-j,k). 

i ^ 
repeat 

- i ^ i + 1 

- Draw n genotypes x[^\ . . . , Xq^\ . . . , Xn^ i.i.d. from the multinomial model distribution 

- Aggregate simulated genotype counts ivj'^ = ; Xq ^ = {Aj,Ak}} 

- Aggregate simulated allele counts N^^^ = {Yik=j ^k] + Sl—i ^jfc) ^^"^ proportions 

qW ^ ivj''V(2n). 

- Compute maximum- likelihood counts iwj^ = (2 - Sjk)N!j'^ xj^^ /{in) 

- Evaluate simulated discrepancy Si = S{Nj^l, Mj^l) 
until i = £ 

return plain P- value, P = ^{i : Si 5= s}/i 



Appendix II: Description of Figures [3], [4], and [5] 

Consider for a sample of genotype counts the linear ordering given by the nondecreasing rearrangement 
of the Hardy- Weinberg equilibrium model counts: if m[j] denotes the jth smallest expected frequency 
among all the model genotype frequencies, 1 ^ j r(r -I- 1)/2, then we denote the corresponding number 
of draws by n^j], and the corresponding number of observed and expected simulated draws under the 
(plain) HWE null hypothesis by Ny] and My] . 

The observed root-mean-square discrepancies are 

= (m[,] -n[,])^ (21) 

while the observed chi-square discrepancies are 

dr = ^^^^^^""^^'J^'. (22) 

The random vectors of expected root-mean-square discrepancies in n i.i.d. draws from the model distri- 
bution are 

= {My]-Ny])\ (23) 
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Algorithm 5.2: Computing the fully conditional P-value 



Input: Observed genotype counts rij.fe and allele counts Uj, number of Monte Carlo simulations 
i, and test statistic S (e.g. S = X'^,G'^, H^, . . .) 

Output: fully conditional P-value associated to test statistic S{nj^k,'n^j.k) 



Compute maximum-likelihood model counts mj_k = (2 — 5.jk)njnk/{An). 
Measure the discrepancy s = S{nj^k,'mj,k)- 

i ^ 
repeat 

- i ^ i+l 

- Apply a random permutation to the sequence of alleles as in ([s]) to obtain n simulated 
genotypes X^^ , . . . , x'^^ , . . . , Xn^ with fixed allele counts Uj . 

- Aggregate simulated genotype counts ivj'] = #{g : Xq ^ = {Aj.Ak}} 

- Evaluate simulated discrepancy Si = S{Nj^l,mj^k) 
until i = £ 

return fully conditional P-value, P = #{« : Si > s}/£ 



and ^ 

^ My] 

To generate the boxplots for the relative root-mean-square discrepancies, we simulated K = 1000 re- 
alizations of n i.i.d. draws from the HWE model in the respective examples. For each simulation, we 



computed the vector of root-mean-square discrepancies (231 and normalized the vector to sum to 1. We 
displayed the distribution of discrepancies using a boxplot: for each term j, the median of the distribution 
Dj ' = (Dj*')J£°'' is indicated by the bulls-eye mark ©. The rectangular box around the median extends 
to the 25th and 75th percentiles of the data, and the whiskers extending from each side of the box reach 
out to the 1 and 99th percentiles of the data. On top of the boxplot, the observed discrepancies, d™" , 
normalized to sum to 1, are indicated by large open circles. 

The chi-square plot for each figure was created by repeating the same set-up as above using the 
relative chi-square discrepancies. 
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